For your group projects, you have probably used a tool like Google Drive or Dropbox to share your work with others. Although you could theoretically share your R files using these tools, it has a few limitations when it comes to version control. For example, how can you easily return to a previous version of your work? Or what happens once your friend and you are overwriting each others’ code? In this tutorial, we’ll introduce you to Git and Github and learn you to streamline your collaboration workflows!
Students are expected to hand in a digital copy of their answers and code according to the submission guidelines.
Every time you make a change in a Google Docs file, the system keeps track of your changes, and you can always return to an earlier version of your work. Git provides similar functionality specifically built for code and is therefore a popular version control system among data analysts and data scientists.
Gits most important features are:
Nothing that is saved to Git is ever lost, so you can always go back to see which results were generated by which versions of your programs. So you never have to worry about overwriting and losing your work. We know the struggle: assignment.docx, assignment-final.docx, assignment-final-FINAL.docx…
Workflows can be separated into so-called branches which allow you to have multiple versions of your work and work on several things at once.
It’s highly useful in a team environment because everyone can work independently on the same files. Since there is a permanent record of who made the change, you can easily trace back each others’ contributions.
It automatically notifies you when your work conflicts with someone else’s. For example, say that for your group assignment you wrote the following in an R Markdown file In this assignment, we explore..., whereas your classmate who was working on the assignment at the same time wrote The goal of this assignment is to..., Git will inform you about a so-called merge conflict once you try to combine versions of your work. In other words, it allows you to easily compare two versions side-by-side and makes sure that you don’t accidentally overwrite each others’ work.
Importance
One of the learning goals is defined as “Setup Git and Github”. So you may wonder: what’s the difference between them? After all, both of them have the phrase “Git” in it, right?
Git is a system for version control primarily used by programmers. It runs on the command line of your local machine (for Mac and Windows users the Terminal and Git bash, respectively). It allows you to keep track of your files and modifications to those files in something called a repository. So here’s an example of what working in Git may look like: a terminal that tells me that I made 3 modifications and added 1 folder (images/) that has not been tracked by Git yet. Don’t worry if you don’t fully understand this, we’ll go over each of these commands in a bit.
GitHub is a website that allows you to upload your Git repositories online. It provides a back-up of your files, has a visual interface for navigating your repositories on the internet, gives others a way to navigate your repositories, and makes collaboration easy. Although Git does not require the use of Github, it is very common to use both.
Below, you find a copy of this very file (version-control.Rmd) hosted on Github. It shows a folder structure just like Mac and Windows file explorer. You can click on any folder and see the folder contents. These repositories can be publicly available, but you can also choose to make a private repository where only can see the contents (like Google Drive).
Let’s try it out!
Create a Github account (we recommend using your university email!), log in, and navigate towards the course repository. Browse through the files and get a feeling for all the tabs and buttons (no worries: you can’t break anything :-). Try to get an answer for each of the questions:
dev)
In this step-by-step tutorial, we’re going to create a new repository and copy it to our local computer. Note that Github also offers a graphical tool to work with Git, yet we strongly recommend sticking to the command line as it helps you to get a better understanding of how all concepts fit together.
Download and install the latest version of Git (Mac users first need to install homebrew).
Head over to Github (the website) and click on the “+” symbol in the top right corner to create a new repository. Give it the name version-control-exercises, keep it public, and check the box that says add a README file:
The README.md file is a special file in the root directory of the repository. Everything in that file automatically renders on the home page of the repository. You should think of it as a way to communicate to others what’s in the repository and how they can run the code (e.g., see the example in the course repository). Currently, our README is still blank (except for the title), but we can easily change that directly from our browser. Click on the pencil icon in the top right corner of the README to edit the file and add a few lines to it. Like R Markdown, you can use the same syntax for formatting (e.g., bold, italics, etc.).
At the bottom of the page, you find two fields near the “Commit changes” subheader. The first field is required and asks you to give a very short description of the changes you have made. The second field is optional and can be used to provide some more details if you made a lot of changes. The idea here is that you can later go back to an earlier version, and these descriptions help you distinguish between the different versions of the file. Once you’re done, click on “Commit changes”. Congratulations! You have now created a repository, including a README file, and created your first commit on GitHub.
git clone <URL repo> where you can find the HTTPS URL (not the SSH one!) in your repository. In case git asks for your passphrase, fill out the password of your Github account. It has now copied the repo version-control-exercises to your local machine.
Note that in this exercise, you have been working with a pre-existing repository (you first created it on Github and then cloned it to your local machine). If you, however, want to create a repository for a new project in the current working directory (e.g., a project you were already working on but did not use Git for), you can simply run git init <folder> in the folder you want to be the root directory to initialize a new Git repository.
Importance
Git does not track files by default. Instead, it waits until you explicitly tell it to pay attention to a file. This also means that the untracked files won’t benefit from version control. The git status command gives you an overview of files in your repository that aren’t yet being tracked.
First, we add files from the working directory to the staging area. That is, a temporary place for files with saves that haven’t been saved yet. To add a file to the staging area use git add <filename> (or git add . if you want to stage all files in the working directory). However, you don’t have to put all of the changes you have made recently into the staging area at once. For example, suppose you are adding a feature to data_preparation.R and spot a bug in data_analysis.R. After you have fixed it, you want to save your work. Since the changes to data_preparation.R aren’t directly related to the work you’re doing in data_analysis.R, you should save your work in two separate commits.
Second, to save the changes in the staging area you use the command git commit. When you commit changes, Git requires you to enter a commit message (like you previously did for the README.md file!). For example, `git commit -m “add function to clean trailing spaces”. This serves the same purpose as a comment in a program: it tells the next person to examine the repository why you made a change.
If you run git commit without -m "your commit message", Git launches a text editor with a template like this:
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch master
# Your branch is up-to-date with 'origin/master'.
#
# Changes to be committed:
# new file / modified: <some_file>
#
As it states, the first couple of lines will be ignored. Press i to change to insert mode, enter your commit message below, press Esc, and write :wq followed by Enter to close the file. A much simpler approach, therefore, is to insert your commit message directly after git commit -m.
Let’s try it out!
cd into the version-control-exercises directory and run git status. If done correctly, you should get a message that there is nothing to commit.
Exercise 2
1. Open up Notepad (Windows) or TextEdit (Mac), create a new text file my_answers.txt, and save it into version-control-exercises. Run git status again, what does it say? 2. Add my_answers.txt to the staging area and commit the file with an appropriate message (e.g., add answers file). 3. Look up your version-control-exercises repository on Github and check whether my_answers.txt is there. Why is that?
Importance
To get our GitHub repo up-to-date, we need to push the changes we have made locally into a remote repository. For this, we use the command git push origin main which means that we want to push the main branch to the origin remote (more details on branches follow below!).
Recall that the remote repository is often a repository in an online hosting service like GitHub. A typical workflow is that you pull in your collaborators’ work from the remote repository so you have the latest version of everything, do some work yourself, then push your work back to the remote so that your collaborators have access to it. Pulling changes is straightforward: the command git pull remote main checks for updates in your repositories and synchronizes them with your local machine. Keep in mind that Git stops you from pulling in changes when doing so might overwrite the things you have done locally. The fix is simple: first commit your local changes and then try to pull again.
Let’s try it out!
Run git push origin main and see whether my_answers.txt now appears in your online Github repository (note that it may ask you to enter your Github username and password). What happens if you run the command a second time?
Exercise 3
The goal of this exercise is to illustrate why it is recommended to regularly pull for changes.
Head over to Github, open the my_answers.txt file, click on the pencil icon, and add a few lines.
Commit your change with the message add data to my_answers.txt on Github.
Open the my_answers.txt file on your local machine and add some other data there (don’t forget to save the text file!).
Run git status; it should say that you modified my_answers.txt. Therefore, add the file to the staging area (git add my_answers.txt), and commit your changes (git commit -m added some other data my_answers.txt locally).
So far so good, but what happens once you try to push your changes to Github? Indeed, you get a message that updates were rejected because the remote contains work that you do not have locally. That is to say, the changes you made to my_answers.txt on Github are still not available on your local machine. To fix this problem, you simply need to update your local repo with what has already occurred in your Github repo using git pull origin main. It asks you to enter a commit message to explain why the merge of both changes (locally and Github) is necessary. To that end, it opens a text editor in the terminal (like you saw before in section 1.4). Press I to insert text (the lines with # will be ignored). Once you’re done, press Esc and type :wq to close the editor and Enter. Then, try to push your changes again (git push origin main) and refresh the repo on Github. Did it work this time?
So the key takeaway of this exercise should be to pull in changes frequently so that you don’t run into problems once you push your work. This especially holds when you’re working on a project with other collaborators.
…
Git is a highly sophisticated tools with many bells and whistles for even the most “power users”. That’s why you’re often bombarded with Git commands if you search for them online. Here we highlight a couple of Git’s more advanced features which we think are important to know.
Importance
Do you recall the Google docs example we discussed earlier? Also, Git has a version control history feature built-in, or to be more precise: an overview of all past commits. In the top right corner of your Github repository you find the following: `
Once you click on 11 commits here, it shows a chronologically ordered list of all historic commit messages, the author, and an unique id (e.g., 0ae104b). You can click on each of these commits to get a side-by-side comparison of the changes (red = deleted, green = added).
Let’s try it out!
When was the first commit on the dev branche? What did Hannes change? How many lines of code did it involve?
We can access the same list of commits through our terminal. cd into the respective Git directory and run git log to view the log of the project’s history. Every time, you press the space bar it moves down the list (so you see older commit messages). Log entries are shown most recent first, and look like this:
commit 9dff4d8e3d7b5aba0082b1c6e115a9e7ffd4339c
Author: Hannes Datta <h.datta@tilburguniversity.edu>
Date: Mon Jan 18 22:03:52 2021 +0100
polish tutorial pages
The long string of characters on the first line is called a hash (or SHA). This hash is normally written as a 40-character hexadecimal string like 9dff4d8e3d7b5aba0082b1c6e115a9e7ffd4339c, but most of the time, you only have to give Git the first 8 characters in order to identify the commit you mean.
Let’s try it out!
Run git log in your own repository. What is the hash of your first commit? If there are many commit messages, you can exit the log history by pressing Esc followed by Enter and finally q (quit).
Exercise 4
Suppose that we accidentally removed the my_answers.txt file and want to restore it in our repository. This exercise shows you how to do that. 1. Remove the my_answers.txt file from version-control-exercises (trust us, you can even empty your trash can!) 2. Look up the hash of the commit in which the file was still present (git log) and copy the SHA (Cmd + C / Ctrl + C). 3. Run git checkout <SHA> and inspect your directory. Is the my_answers.txt file there, again?
If you followed the steps above, you probably got the message below. It means that Git switched to another branch, but that our original work still - before running git checkout still resides on another branch. In the next section, we explain the applications of branches and how you can switch between them.
Note: switching to '<SHA>'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Undo changes to unstaged files
Suppose you have made changes to a file, then decide you want to undo them. Your text editor may be able to do this, but a more reliable way is to let Git do the work. The command:
git checkout -- filename
will discard the changes that have not yet been staged. (The double dash – must be there to separate the git checkout command from the names of the file or files you want to recover.)
Use this command carefully: once you discard changes in this way, they are gone forever.
Undo changes to staged files
At the start of this chapter you saw that git reset will unstage files that you previously staged using git add. By combining git reset with git checkout, you can undo changes to a file that you staged changes to. The syntax is as follows.
git reset HEAD path/to/file
git checkout -- path/to/file
Branches allow you to have multiple versions of your work, and lets you track each version systematically. Each branch is like a parallel universe: changes you make in one branch do not affect other branches (until you merge them back together).
By default, every Git repository has a branch called master. To list all of the branches in a repository, you can run the command git branch. The branch you are currently in will be shown with a * beside its name.
You previously used git checkout with a commit hash to switch the repository state to that hash. You can also use git checkout with the name of a branch to switch to that branch (only works if all changes on the current branch have been comitted).
Exercise: * Naar nieuwe branch * Op die branch dingen verwijderen * Committen * Teruggaan naar de oude branch (bestand is er nog steeds)
The most common thing you want to do is to create a branch then switch to that branch.
In the previous exercise, you used git checkout branch-name to switch to a branch. To create a branch then switch to it in one step, you add a -b flag, calling git checkout -b branch-name. The contents of the new branch are initially identical to the contents of the original. Once you start making changes, they only affect the new branch.
Merging branches
To merge two branches, you run git merge source destination (gaat er dus niet om om welke branch je op dat moment bent!). Git automatically opens an editor so that you can write a log message for the merge; you can either keep its default message or fill in something more informative.
The file todo.txt initially contains these two lines:
A) Write report.
B) Submit report.
You create a branch called update and modify the file to be:
A) Write report.
B) Submit final version.
C) Submit expenses.
You then switch back to the master branch and delete the first line, so that the file contains:
B) Submit report.
Git can merge the deletion of line A and the addition of line C automatically - maar B wel een probleem omdat het in beide versies is aangepast naar iets anders
Inside the file, Git leaves markers that look like this to tell you where the conflicts occurred:
<<<<<<< destination-branch-name
...changes from the destination branch...
=======
...changes from the source branch...
>>>>>>> source-branch-name
nano filename (om tekstbestanden te openen)
Ctrl-K: delete a line. Ctrl-U: un-delete a line. Ctrl-O: save the file (‘O’ stands for ‘output’). Ctrl-X: exit the editor.
Command + O (save) + Enter -> Command + X (to leave)
Importance
Data analysis often produces temporary or intermediate files that you don’t want to save. You can tell it to stop paying attention to files you don’t care about by creating a file in the root directory of your repository called .gitignore and storing a list of wildcard patterns that specify the files you don’t want Git to pay attention to. For example, if .gitignore contains:
build
*.mpl
then Git will ignore any file or directory called build (and, if it’s a directory, anything in it), as well as any file whose name ends in .mpl.
Let’s try it out!
XX
https://www.dataschool.io/simple-guide-to-forks-in-github-and-git/
Is a markdown document linking to good resources for learning data science topics. First need to be logged in to your Github account and click on Fork. Makes a copy of the repo in your own account which includes all repository files and even the commit history is preserved. There is also a link to the original repo that you forked. Once you fork a repo it does not automatically stay in sync with the original repo.
You fork a repo because you want a copy of the files or because you intend to contribute to that repo. 1. Fork the repo 2. Make a modification to your fork of the repo 3. Send a pull request to the repo owner asking them to pull your changes into their repo.
If you want a copy of a Github repo, you just fork it, and then you can clone it to your local computer. After making changes and committing these changes, you can push the changes back up to Github. Finally, we make a pull request to ask the owner of the repo to pull your changes into their repo.
Exercise X
https://github.com/datasciencemasters/go
XXX
E.g., Make a contribution to marketing-tools (make it relevant to marketing students!) …
| Command | Use |
|---|---|
git clone <URL> |
Makes a clone of the repository at the specified URL |
git status |
Outputs status, including what branch you are on and what changes are staged |
git add . |
Adds all changes to the staging area to be committed |
git add <file_name> |
Adds changes to the specified file to the staging area to be committed |
git commit |
Commits staged changes and allows you to write a commit message |
git push origin <branch_name> |
Pushes local changes to the specified branch of the online repository |
git pull origin <branch_name> |
Pull changes from the online repository into local repository |
git log |
Outputs a log of past commits with their commit messages |
git checkout -b <branch_name> |
Creates and switches to a new branch |
git checkout <branch_name> |
Switches to the specified branch |
git merge <branch_name> |
Merges the branch you are on into the specified branch |